ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on VITS

نویسندگان

چکیده

Voice cloning aims to synthesize the voice with a new speaker’s timbre from small amount of speech. Current methods, which focus on modeling speaker timbre, can speech similar timbres. However, prosody these methods is flat, lacking expressiveness and ability control cloned To solve this problem, we propose novel method ZSE-VITS (zero-shot expressive VITS) based end-to-end synthesis model VITS. Specifically, use VITS as backbone network add recognition TitaNet encoder realize zero-shot cloning. We explicit information avoid effects adjust using prediction fusion directly. widen pitch distribution train datasets augmentation improve generalization model, fine-tune predictor alone in emotion corpus learn various styles. The objective subjective evaluations open show that our generate more artificially without affecting similarity timbre.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Verbmobil Interface Terms (VITs)

This article describes the concepts and the contents of Verbmobil Interface Terms (VITs). In VITs all linguistic information of an utterance relevant for translation is represented. They are used to provide an interface representation between several linguistic and dialog components of the Verbmobil system. Information in VITs is encoded in a recordlike data structure. The fields are variable-f...

متن کامل

Does the use of vaginal-implant transmitters affect neonate survival rate of white-tailed deer Odocoileus virginianus?

We compared survival of neonate white-tailed deer Odocoileus virginianus captured using vaginal-implant transmitters (VITs) and traditional ground searches to determine if capture method affects neonate survival. During winter 2003, 14 adult female radio-collared deer were fitted with VITs to aid in the spring capture of neonates; neonates were captured using VITs (N=14) and traditional ground ...

متن کامل

Direct Expressive Voice Training Based on Semantic Selection

This work aims at creating expressive voices from audiobooks using semantic selection. First, for each utterance of the audiobook an acoustic feature vector is extracted, including iVectors built on MFCC and on F0 basis. Then, the transcription is projected into a semantic vector space. A seed utterance is projected to the semantic vector space and the N nearest neighbors are selected. The sele...

متن کامل

A Unified approach for Conventional Zero-shot, Generalized Zero-shot and Few-shot Learning

Prevalent techniques in zero-shot learning do not generalize well to other related problem scenarios. Here, we present a unified approach for conventional zero-shot, generalized zero-shot and few-shot learning problems. Our approach is based on a novel Class Adapting Principal Directions (CAPD) concept that allows multiple embeddings of image features into a semantic space. Given an image, our ...

متن کامل

Semi-supervised Zero-Shot Learning by a Clustering-based Approach

In some of object recognition problems, labeled data may not be available for all categories. Zero-shot learning utilizes auxiliary information (also called signatures) describing each category in order to find a classifier that can recognize samples from categories with no labeled instance. In this paper, we propose a novel semi-supervised zero-shot learning method that works on an embedding s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Electronics

سال: 2023

ISSN: ['2079-9292']

DOI: https://doi.org/10.3390/electronics12040820